Handle failure in publish invocation#83
Handle failure in publish invocation#83mrginglymus wants to merge 3 commits intojenkinsci:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #83 +/- ##
============================================
- Coverage 86.94% 86.11% -0.83%
Complexity 139 139
============================================
Files 16 16
Lines 628 634 +6
Branches 48 50 +2
============================================
Hits 546 546
- Misses 65 69 +4
- Partials 17 19 +2
Continue to review full report at Codecov.
|
|
WithChecksCallBack.onFailure calls both WithChecksStepExecution.publish and If WithChecksStepExecution.publish has already called as the onFailure method is part of the FutureCallback interface: https://guava.dev/releases/snapshot-jre/api/docs/com/google/common/util/concurrent/FutureCallback.html#onFailure-java.lang.Throwable- In pipelines, I think the second onFailure call will trigger a warning from CpsStepContext.completed: https://github.com/jenkinsci/workflow-cps-plugin/blob/de5e69b5e85bd5a89960e65f3a634256b24b7af5/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsStepContext.java#L339 That problem is not introduced by this pull request, but perhaps it is made more frequent. I think what should happen is that StepContext.onFailure is called only once, with the A similar violation in onStart looks more difficult to fix, as CpsBodyExecution.launch does not seem to expect BodyExecutionCallback.onStart to invalidate the StepContext already: https://github.com/jenkinsci/workflow-cps-plugin/blob/261df120d1c2f228b2137c3f880a7741293ea5f2/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsBodyExecution.java#L133-L136 |
|
CpsBodyExecution.FailureAdapter.receive has a loop that calls BodyExecutionCallback.onFailure of each callback: https://github.com/jenkinsci/workflow-cps-plugin/blob/261df120d1c2f228b2137c3f880a7741293ea5f2/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsBodyExecution.java#L360-L362 Before that, it has already called setOutcome. Does that mean WithChecksCallBack.onFailure should not call |
32aef2e to
f88c550
Compare
|
There we go; we now only call |
|
checkstyle is failing |
| @Override | ||
| public void stop(final Throwable cause) { | ||
| if (publish(getContext(), new ChecksDetails.ChecksDetailsBuilder() | ||
| Exception publishFailure = publish(getContext(), new ChecksDetails.ChecksDetailsBuilder() |
There was a problem hiding this comment.
this is a bit odd for java, normally you would throw the exception from the publish method and catch it here, possibly a known checked exception
XiongKezhi
left a comment
There was a problem hiding this comment.
have your investigated how other enclosing steps handle such zombie problems
| .build()); | ||
| return null; | ||
| } | ||
| catch (RuntimeException e) { |
There was a problem hiding this comment.
is this catch needed? if u are catching the exception outside this whole function (like in line 112) instead of returning the exception, this is not needed?
|
Let me just take a step back so we can work out what the problem is here. This is the stack trace that I believe to be the cause of my zombie jobs: Whilst apparently innocuous, these zombie jobs are actually quite a bad bug as they can cost a lot of money - I've had elastic agents sitting for three days over the weekend doing nothing because they have been blocked by a zombie job. So:
|
|
sounds reasonable depending on what the code looks like after |
|
Another option - this could well be an issue with GithubBranchSourcePlugin, where handles unmergables, but doesn't. |
|
I think we should at least some determined steps to reproduce the zombie jobs, and the problem may be that our previous implementation deviates from the docs.
Yes, we should make sure of that. In our previous implementation, our
The While I'm scrolling the implementations of body execution to look for a normal pattern (due to the lack of docs specifying the call chain of an interrupted/failed body execution), I notice that many other implementations just implement the tail call: https://github.com/jenkinsci/kubernetes-plugin/blob/a253fba5085eabbb6e0f026f6f5576c7667c31a3/src/main/java/org/csanchez/jenkins/plugins/kubernetes/pipeline/ContainerStepExecution.java#L101. It's cleaner, but since our implementation behaves differently on a failure and a cancelation, not sure if this is useful. |
|
@XiongKezhi jenkinsci/workflow-step-api-plugin#60 has a reproduction, although as per the latest comment I need to move it to another repo where the ‘bug’ actually is. otherwise, I think the usage we have here is fine, it’s just that upstream is lacking error handling. Obviously we should do our best not to generate errors too... |
|
there's a checkstyle error in this PR btw in case you missed it |
|
@timja re the checkstyle error - can I just suppress it? The point is I need to patch around the vulnerability of the cps plugin to zombie executions, the only sure-fire way of doing so is ensuring no runtime exceptions make it out of |
yeah sure |
|
Closing in favour of #91 |
From what I think I can glean from the stack trace/reading similar code, an unhandled exception in any of the
BodyExecutionCallbackmight lead to zombie jobs on executors. This manifested itself when thepublishat the start of awithChecksstep failed because of a merge conflict. The overall job aborted succesfully, but the three outstanding parallel stages continued to block their executors and required manual cleanup.